White Wine Quality Analysis by Richard Haughton

Univariate Plots Section

Basic Structure

## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [5] "residual.sugar"       "chlorides"            "free.sulfur.dioxide"  "total.sulfur.dioxide"
##  [9] "density"              "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density             pH       
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0        Min.   :0.9871   Min.   :2.720  
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090  
##  Median :0.04300   Median : 34.00      Median :134.0        Median :0.9937   Median :3.180  
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4        Mean   :0.9940   Mean   :3.188  
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280  
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0        Max.   :1.0390   Max.   :3.820  
##    sulphates         alcohol         quality     
##  Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :0.4700   Median :10.40   Median :6.000  
##  Mean   :0.4898   Mean   :10.51   Mean   :5.878  
##  3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :1.0800   Max.   :14.20   Max.   :9.000
##                      vars    n    mean      sd  median trimmed     mad  min     max   range skew
## X                       1 4898 2449.50 1414.08 2449.50 2449.50 1815.44 1.00 4898.00 4897.00 0.00
## fixed.acidity           2 4898    6.85    0.84    6.80    6.82    0.74 3.80   14.20   10.40 0.65
## volatile.acidity        3 4898    0.28    0.10    0.26    0.27    0.09 0.08    1.10    1.02 1.58
## citric.acid             4 4898    0.33    0.12    0.32    0.33    0.09 0.00    1.66    1.66 1.28
## residual.sugar          5 4898    6.39    5.07    5.20    5.80    5.34 0.60   65.80   65.20 1.08
## chlorides               6 4898    0.05    0.02    0.04    0.04    0.01 0.01    0.35    0.34 5.02
## free.sulfur.dioxide     7 4898   35.31   17.01   34.00   34.36   16.31 2.00  289.00  287.00 1.41
## total.sulfur.dioxide    8 4898  138.36   42.50  134.00  136.96   43.00 9.00  440.00  431.00 0.39
## density                 9 4898    0.99    0.00    0.99    0.99    0.00 0.99    1.04    0.05 0.98
## pH                     10 4898    3.19    0.15    3.18    3.18    0.15 2.72    3.82    1.10 0.46
## sulphates              11 4898    0.49    0.11    0.47    0.48    0.10 0.22    1.08    0.86 0.98
## alcohol                12 4898   10.51    1.23   10.40   10.43    1.48 8.00   14.20    6.20 0.49
## quality                13 4898    5.88    0.89    6.00    5.85    1.48 3.00    9.00    6.00 0.16
##                      kurtosis    se
## X                       -1.20 20.21
## fixed.acidity            2.17  0.01
## volatile.acidity         5.08  0.00
## citric.acid              6.16  0.00
## residual.sugar           3.46  0.07
## chlorides               37.51  0.00
## free.sulfur.dioxide     11.45  0.24
## total.sulfur.dioxide     0.57  0.61
## density                  9.78  0.00
## pH                       0.53  0.00
## sulphates                1.59  0.00
## alcohol                 -0.70  0.02
## quality                  0.21  0.01
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide
## 1 1           7.0             0.27        0.36           20.7     0.045                  45
## 2 2           6.3             0.30        0.34            1.6     0.049                  14
## 3 3           8.1             0.28        0.40            6.9     0.050                  30
## 4 4           7.2             0.23        0.32            8.5     0.058                  47
## 5 5           7.2             0.23        0.32            8.5     0.058                  47
## 6 6           8.1             0.28        0.40            6.9     0.050                  30
##   total.sulfur.dioxide density   pH sulphates alcohol quality
## 1                  170  1.0010 3.00      0.45     8.8       6
## 2                  132  0.9940 3.30      0.49     9.5       6
## 3                   97  0.9951 3.26      0.44    10.1       6
## 4                  186  0.9956 3.19      0.40     9.9       6
## 5                  186  0.9956 3.19      0.40     9.9       6
## 6                   97  0.9951 3.26      0.44    10.1       6
  • 4898 rows, 13 columns - 1 id column (represents a wine under test), 11 features and 1 output variable (quality)
  • All input features are +ve numerical, the output variable is ordinal taking values 0 - 10 (though 3 is the min observed and 9 max)

Histogram individual features

Omitting some features that appeared less interesting for brevity:

  • citric.acid
  • free.sulfur.dioxide
  • sulphates

quality

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

## 
##    L    M    H 
##  183 3655 1060
  • normalish distribution
  • introduced a categorical variable with ‘low <5’, ‘med 5,6’, ‘high >6’, might be useful later rather having to compare each quality level

fixed.acidity

##   vars    n mean   sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 4898 6.85 0.84    6.8    6.82 0.74 3.8 14.2  10.4 0.65     2.17 0.01

  • nicely normal distribution, few outliers

volatile.acidity

##   vars    n mean  sd median trimmed  mad  min max range skew kurtosis se
## 1    1 4898 0.28 0.1   0.26    0.27 0.09 0.08 1.1  1.02 1.58     5.08  0

  • right skewed
  • Leptokurtic - kurtosis > 3, peaky with thicker tail than normal

residual.sugar

##   vars    n mean   sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 4898 6.39 5.07    5.2     5.8 5.34 0.6 65.8  65.2 1.08     3.46 0.07

  • looks extreme right skewed (very long tail right)
  • some absurd outliers well away from the norm
  • log x plot appears bi-modal which if nothing more shows that the long tail accounts for a significant proportion of the data

chlorides

##   vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## 1    1 4898 0.05 0.02   0.04    0.04 0.01 0.01 0.35  0.34 5.02    37.51  0

  • really high kurtosis, so very definitely Leptokurtik
  • also right skewed

total_sulfur_dioxide

##   vars    n   mean   sd median trimmed mad min max range skew kurtosis   se
## 1    1 4898 138.36 42.5    134  136.96  43   9 440   431 0.39     0.57 0.61
##    10%    20%    30%    40%    50%    60%    70%    80%    90%    99%   100% 
##  87.00 102.00 113.00 124.00 134.00 147.00 160.00 176.00 195.00 241.03 440.00

  • pretty normal?

density

##   vars    n mean sd median trimmed mad  min  max range skew kurtosis se
## 1    1 4898 0.99  0   0.99    0.99   0 0.99 1.04  0.05 0.98     9.78  0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

  • high kurtosis?

pH

##   vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## 1    1 4898 3.19 0.15   3.18    3.18 0.15 2.72 3.82   1.1 0.46     0.53  0

  • pretty normal

alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20
##   vars    n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 4898 10.51 1.23   10.4   10.43 1.48   8 14.2   6.2 0.49     -0.7 0.02

  • bit right skewed
  • Platykurtic given the low (in fact negative) kurtosis

Individual features by quality

Use a variety of plot types to get an indication of whether a given feature varies significantly with quality

Ommiting several features that appeared less interesting for brevity:

  • citric.acid
  • free.sulfur.dioxide
  • sulphates

fixed.acidity

  • allways take quality 3 and 9 readings with a pinch of salt as there are relatively few of them
  • I’d say no conclusive difference with quality
  • freqpoly suggests lower quality wines have higher fixed.acidity more often

volatile.acidity

  • some evience that lower quality wines have higher volatile.acidity
  • not too much to show between med and high quality

residual.sugar

  • a suggestion here that high quality wines have lower residual.sugar than average wines
  • not sure how to interpret low quality wines also having lower residual.sugar
  • each category retains the bi-modal(ish) distribution of the whole sample

chlorides

  • looks like a strong(ish) suggestion that higher quality wines have lower chlorides albeit not by much!
  • (almost 50 percent of 7,8 graded wines are less than the 25 percentile of grade 5,6)

total.sulfur.dioxide

  • possibly lower levels for higher quality wines?
  • perhaps more accurately the concentration of values for high quality wines is more compact arount the low 100’s

density

  • fairly strong suggestion that lower density = better quality

pH

  • weak indication that higher pH = higher quality
  • facet_wrap suggests it’s very weak

alcohol

  • strong suggestion that high alcohol contributes heavily to high quality

Univariate Analysis

What is the structure of your dataset?

See above

What is/are the main feature(s) of interest in your dataset?

  • residual.sugar - Interesting due to it’s long thick tail, compared to the other features with some suggestion that it is in fact bi-modal. Some suggestion that higher quality wines have lower levels than average wines

  • alcohol - By far the clearest indicator of quality, with higher alcohol content indicating higher quality. Also the distribution was much more Platykurtic compared with the other input variables which mostly tended to be Leptokurtic

(Suspect there is a relationship between high alcohol and low residual.sugar?)

  • chlorides - Looks like higher quality wines have lower chloride levels

  • density - Looks like higher quality wines have lower density

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

  • fixed and volatile acidity might be useful to ‘weed out’ low quality wines
  • similar for free.sulfur.dioxide

Did you create any new variables from existing variables in the dataset?

Not exactly but I did create a variation of the output variable - quality.category

Rather than have to consider all 10 possible quality levels I instead had 3 categories (Low, Med, High) to simplify analysis

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

residual.sugar had a long thick tail, so I performed a log 10 transformation which enabled me to get a clearer idea of where the bulk of the data lay.
As well as a large peak around 2(ish) there was a shorter/fatter (but roughly equal in size) peak around 10.

Bivariate Plots Section

Bi-variate matrix

Produce a bi-variate matrix to show correlation/distribution between each pair of features to steer further analysis. Produce a matrix for all the data and then again for just the higher quality wines to see (if) how they vary.

This takes ages to produce, so I pre-prepared images (and commented out code):

All Qualities

alt text

High Quality

alt text

I will use this output to choose which bi and multi variate plots to produce

Quality vs Individual Input Features

Scatter plots to compare individual features against quality.

Omitting several features for brevity:

  • fixed.acidity
  • citric.acid
  • free.sulfur.dioxide
  • pH
  • sulphates

quality vs volatile.acidity

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$volatile.acidity
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2215214 -0.1676307
## sample estimates:
##       cor 
## -0.194723
## 
##  Spearman's rank correlation rho
## 
## data:  wines$quality and wines$volatile.acidity
## S = 2.3434e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1965617
  • no real correlation

quality vs residual.sugar

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$residual.sugar
## t = -6.8603, df = 4896, p-value = 7.724e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.12524103 -0.06976101
## sample estimates:
##         cor 
## -0.09757683
## 
##  Spearman's rank correlation rho
## 
## data:  wines$quality and wines$residual.sugar
## S = 2.1191e+10, p-value = 8.822e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.08206979
  • interesting to see that the median/mean are a lot higher than the highest density residual.sugar values
  • still seeing a reduction for higher quality wines
  • looks like there should be a stronger correlation because there is a drop off from medium to high quality wines. could it be the low quality (value 3 - 4) which are hiding the trend?

quality vs chlorides

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$chlorides
## t = -15.024, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2365501 -0.1830039
## sample estimates:
##        cor 
## -0.2099344
## 
##  Spearman's rank correlation rho
## 
## data:  wines$quality and wines$chlorides
## S = 2.5743e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.3144885
  • moderate correlation (with spearman anyway)
  • possible interesting group of high chlorides for quality = 8

quality vs total.sulfur.dioxide

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$total.sulfur.dioxide
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2017563 -0.1474524
## sample estimates:
##        cor 
## -0.1747372
## 
##  Spearman's rank correlation rho
## 
## data:  wines$quality and wines$total.sulfur.dioxide
## S = 2.3436e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1966803

quality vs density

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233
## 
##  Spearman's rank correlation rho
## 
## data:  wines$quality and wines$density
## S = 2.6406e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.348351
  • fairly clear that most high quality wines have lower density (again we think this is directly because of the low density of alcohol)
  • still looks to me like a small cluster of higher quality wines with high density

quality vs alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$alcohol
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747
## 
##  Spearman's rank correlation rho
## 
## data:  wines$quality and wines$alcohol
## S = 1.096e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4403692
  • confirmed as the strongest relationship to quality
  • look at quality = 8, clear air gap between the main cluster of ‘high alcohol’ wines and a smaller cluster of ‘low alcohol’ wines

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

(more on this in the Multivariate section)

  • confirmed that there is a strong positive correlation between alcohol and quality.

  • surprised by the very small correlation between residual.sugar and quality. Curiously it looks like (on average) levels of residual.sugar start low for low quality wines, rise for medium quality wines and dip again as we move into high quality wines. What if anything does this suggest?

  • it’s beginning to look like there is a recipe for creating higher quality wines that is something like:
    • High alcohol levels ~(11 - 14)
    • Low residual.sugar ~(1 - 3)
    • Low chloride levels ~(0.03 - 0.045)
    • (possibly) Lower levels of total.sulfur.dioxide

(I’m ignoring density because I think it is a direct consequence of levels of the above)

  • there is just a hint that there is an alternative recipe for higher quality wines (difficult to confirm given the relatively small number of samples):
    • Lower alcohol levels ~(8.5 - 9)
    • High levels of residual.sugar ~(14 - 16)
    • Higher levels of chloride ~(0.06)

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • strong +ve correlation between total.sulfur.dioxide and density and corresponding strong -ve correlation between total.sulfur.dioxide and alcohol

  • Surprised that whilst we observe a strong -ve correlation between fixed.acidity and pH (as you might expect) there is little to no relationship between volatile.acidity and pH

What was the strongest relationship you found?

Strongest relationship between a feature and the output variable (quality) was for ‘alcohol’ with (Pearson) correlation of 0.436

Overall the strongest correlation was between residual.sugar and density with (Pearson) correlation of 0.839. Closely followed by alcohol and density (-0.78).

Multivariate Plots Section

Input Feature vs Input Feature split on quality.category

Look for correlations between input variables, start with those that seem to have a significant impact upon quality and indeed split by quality (to provide a third and thus multivariate plot)

chlorides vs alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  wines$chlorides and wines$alcohol
## t = -27.016, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3843183 -0.3355673
## sample estimates:
##        cor 
## -0.3601887
## 
##  Spearman's rank correlation rho
## 
## data:  wines$chlorides and wines$alcohol
## S = 3.0763e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5708064
  • is there a suggestion that ‘low alcohol’ + ‘high chlorides’ = high quality?
  • bit worried that there are only ‘few’ samples of high chlorides, skewing?
  • the pattern though, alcohol increases as chlorides decrease seems consistent accross quality categories

density vs alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  wines$density and wines$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376
## 
##  Spearman's rank correlation rho
## 
## data:  wines$density and wines$alcohol
## S = 3.568e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.8218551

residual.sugar vs alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  wines$residual.sugar and wines$alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312
## 
##  Spearman's rank correlation rho
## 
## data:  wines$residual.sugar and wines$alcohol
## S = 2.8304e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.4452574
  • again we see that there is a cluster (albeit small) of high quality wines and see that there looks like:
    • low alcohol + high sugar = high quality
  • hard to see because there are so many medium quality wines but the density of high quality wines with alocohol 11 - 14 and residual.sugar < 5 combined with the ggsmooth line illustrates that this is the sweetspot. it is also telling that there are comparatively few low and medium quality wines in this range

volatile.acidity vs alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  highwines$volatile.acidity and highwines$alcohol
## t = 19.179, df = 1058, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4618272 0.5512684
## sample estimates:
##       cor 
## 0.5079155
  • so virtually no correlation for low/med quality wines
  • a strong positive correlation for high quality wines
  • this is the strongest difference in relationship between features accross qualities
  • note that a similar treatment of fixed.acidity vs alcohol gave a similar but less extreme result this time with a correlation of -0.3 (note negative) for high quality wines

density vs chlorides

## 
##  Pearson's product-moment correlation
## 
## data:  wines$density and wines$chlorides
## t = 18.624, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2308679 0.2831779
## sample estimates:
##       cor 
## 0.2572113
## 
##  Spearman's rank correlation rho
## 
## data:  wines$density and wines$chlorides
## S = 9629500000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5083018
  • perhaps, more chlorides directly increases density? certainly see an increase accross the quality ranges
  • can we say that density shows alcohol and chlorides re-inforcing each other?

residual.sugar vs fixed.acidity

Article discussing relationhsip between alcohol and r.s also suggests you need high acidity with high sugar

http://drinks.seriouseats.com/2013/04/wine-jargon-what-is-residual-sugar-riesling-fermentation-steven-grubbs.html

## 
##  Pearson's product-moment correlation
## 
## data:  wines$residual.sugar and wines$fixed.acidity
## t = 6.2537, df = 4896, p-value = 4.348e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06116674 0.11673612
## sample estimates:
##       cor 
## 0.0890207
## 
##  Spearman's rank correlation rho
## 
## data:  wines$residual.sugar and wines$fixed.acidity
## S = 1.7494e+10, p-value = 6.955e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1067249
## 
##  Pearson's product-moment correlation
## 
## data:  highwines$residual.sugar and highwines$fixed.acidity
## t = 8.3967, df = 1058, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1926383 0.3055648
## sample estimates:
##       cor 
## 0.2499514

residual.sugar vs fixed.acidity

  • so a much stronger (albeit still smallish), positive correlation for high quality wines
  • when sugar is low, acidity seems to vary widely
  • when sugar is higher, acidity seems to be higher and there is a smallish but not insignificant cluster of higher quality wines when suger is higher and acidity is higher

[residual.sugar vs volatile.acidity]

for completeness I also checked res.sug against volatile.acidity, this showed virtually no correlation for any quality

pH vs fixed.acidity

## 
##  Pearson's product-moment correlation
## 
## data:  wines$pH and wines$fixed.acidity
## t = -32.934, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4485154 -0.4026542
## sample estimates:
##        cor 
## -0.4258583
  • srong negative correlation accross all quality categories
  • intuitively makes sense given pH is a measure of acidity (with higher acid at lower pH).
  • Apparently though there is no “direct connection between total acidity and pH” http://en.wikipedia.org/wiki/Acids_in_wine

residual.sugar vs total.sulfur.dioxide

## 
##  Pearson's product-moment correlation
## 
## data:  wines$residual.sugar and wines$total.sulfur.dioxide
## t = 30.669, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3776791 0.4246712
## sample estimates:
##       cor 
## 0.4014393
  • again fairly strong positive correlation for all qualities
  • again we see 2 clusters of high quality
    • major - low res.sug and variable total.sulfur.dioxide
    • minor - high res.sug & high total.sulfur.dioxide

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

chlorides vs alcohol

During the bi-variate section it looked like:

  • Big cluster of high quality wines with ‘high alcohol’
  • similarly, Big cluster of high quality wines with ‘low chlorides’

  • Smaller cluster of high quality wines with ‘low alcohol’
  • similarly, Smaller cluster of high quality wines with ‘high chlorides’

The multivariate plot between chlorides and alcohol re-inforces this suspision, it looks like we see

  • high alcohol + low chlorides = high quality (large cluster of this)
  • low alcohol + low chlorides = high quality (smaller cluster of this)
  • medium alcohol + medium chlorides = medium quality (very large cluster)

residual.sugar vs alcohol

We appear to be seeing similar behaviour for these two:

  • high alcohol + low residual.sugar = high quality (large cluster of this)
  • low alcohol + high residual.sugar = high quality (smaller cluster of this)

It’s less clear medium (and low) quality wines fall between these clusters

(please note: the ‘=’ is misleading because of course there are also lower quality wines that follow the same pattern)

Were there any interesting or surprising interactions between features?

  • The strong positive correlation between volatile.acidity and alcohol for higher quality wines was surprising given there was virtually no correlation for lower and medium quality wines

  • Surprised me that when we look at high quality wines only we see a fairly significant +ve relationship between volatile.acidity and alcohol whilst a -ve corraltion between fixed.acidity and alcohol:
    • volatile.acidity vs alcohol = 0.5
    • fixed.acidity vs alcohol = -0.3

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

NO


Final Plots and Summary

The biggest single contributor to the quality of wine is ….. alcohol!

Alcohol is by far the single biggest contributor to the quality of white wine (in this sample). As a crude measure it has a correlation (Pearson) value of 0.436, over twice as much as the next most significant (chlorides at -0.21)

The density plot shows quite clearly the marked increase in quality as you increasethe level of alcohol, with a sweetspot between around 11 to 13.

The distribution of low and medium quality wines is ‘quite similar’ whereas (as expected) the distribution of high quality wines is noticeably right shifted (higher alcohol).

It’s also worth noting that there is a smaller peak of higher quality wines with lower alcohol content (around 9%). So, we have:

  • large cluster of high quality wines with high alcohol (11 - 13)
  • small cluster of high quality wines with low alcohol (~9)

A similar pattern could be observed for certain other features. For example:

  • residual.sugar
    • large cluster of high quality wines with low residual.sugar
    • small cluster of high quality wines with high residual.sugar
  • chlorides
    • large cluster of high quality wines with low chlorides
    • small cluster of high quality wines with high chlorides

We’ll look at how these contribute together towards the quality of wine in the following plots

The recipe for a high quality wine is …..

Here we plot chlorides against alcohol facet wrapped by qaulity to see whether they re-inforce each other. In particular to see whether the high quality clusters (large and small) outlined earlier still exist when we combine features.

Although it is difficult to be certain, particularly because of the uneven distribution (there are many more medium quality wines), it looks like the two high quality clusters hypothesis just about still holds.

There does appear to be a large clustering of blue (high quality) top right with low chlorides and high alcohol, though it’s not very dense. Similarly there appears to be a smaller cluster bottom right (low alcohol and high chlorides).
Moreover (with a bit of a squint) it looks like the critical mass of low/medium quality wines sit somewhere between the two clusters.

An analogous pattern could be observed if we also plotted the following:

  • residual.suger vs alcohol
  • residual.sugar vs chlorides

Tentative (very) hypothesis

There are two recipes for a good white wine:

  1. High alcohol () + Low residual.sugar () + Low chlorides ()
  2. Low alcohol () + High residual.sugar () + High chlorides ()

In actual fact there are plenty of poorer quality wines that follow these recipes but you will increase your probability of having a high quality wine.

Volatile.acidity and alcohol impact on quality

Plot Three

Description Three

When plotting volatile.acidity against alcohol split by quality we see a marked difference in distribution for high quality wines compared to the rest.
We see a fairly strong positive correlation (Pearson 0.5) for high quality wines vs virtually nothing for the rest.

So, getting the balance of volatile.acidity to alcohol level correct might be a further worthwhile consideration when producting/predicting/evaluating white wine. ——

Reflection

Issues/Difficulties

  1. All the analysis suffers from one big flaw which is my fundamental ignorance of chemistry (subject matter expertise - missing), meaning that observations that seem interesting to me might well be obvious/inevitable and vice versa.

  2. Also this is a relatively small data set and thus subject to large errors and wrong conclusions.

  3. The vast majority of the data had mid ranging quality values (5 or 6), it was therefore difficult to compare, in practice I suspect there was more distance between those wines than a difference of just one point would suggest. A finer scale might have enabled more interesting analysis and/or the individual ratings per reviewer/per wine rather than an aggregated score per wine.

  4. I found it difficult (and eventually abandoned it) to abstract repeated code into functions. This was largely due to the fact that for most plots I had to forensicly/manually calculate bounds.

Next Steps

  1. Moving forwards it would (perhaps) be interesting to apply a clustering model (e.g. K-Means) to the high quality wines to determine whether the two categories do really exist.

  2. Additionally we could look to train a predictive model (e.g. logistic regression or neural network) to predict the quality of wine (perhaps explicitly favouring the selected features)